As part of Brainnest’s Market Research training we were given real customer data from Yondo, a service to host and sell audiovisual content. We are tasked with finding trends that could help us build a Buyer Persona. I chose to leverage a k-means clustering algorithm to inform the qualitative build of the persona with some quantitative rigor! I filtered the data set for customers within the largest segment and built a buyer persona based on their social media presence.
As we import our data and take a first look we can see that we are dealing with many variables and a lot of missing values.
df <- read.csv("/Users/matiasfonolla/Desktop/Market Research - Training /Market Research/Yondo.xlsx - QUERY_FOR_YONDOFINAL.csv",
na.strings = "")
# We have 60 variables and 284 observations
dim(df)
## [1] 284 60
# Names of our variables
names(df)
## [1] "StoreName" "Age"
## [3] "Gender" "CountryName"
## [5] "City" "State"
## [7] "TimeZone" "SignUpType"
## [9] "SubscriptionStatus" "PlanName"
## [11] "Industry" "AlreadySelling"
## [13] "Revenue" "CreatedDate"
## [15] "subdomain" "StoreUrl"
## [17] "dasboard.link" "F20"
## [19] "Is.it.connected.to.their.website" "Are.they.active"
## [21] "Notes" "Household.Income"
## [23] "Marital.Status" "Home.Owner.Status"
## [25] "Length.of.Residence" "Education"
## [27] "Occupation" "Health...Wellness"
## [29] "Travel" "Auto.Parts"
## [31] "Kids...Babies" "Nutrition"
## [33] "Home...Garden" "Garden...Patio"
## [35] "Garden.Supplies" "Home.Decor"
## [37] "Home.Improvement" "Kitchen...Dining"
## [39] "Pets...Supplies" "Gift.Buyer"
## [41] "Toys" "Sports...Outdoors"
## [43] "Beauty" "Mens.Clothing"
## [45] "Shoes" "Womens.Clothing"
## [47] "Jewelry" "Electronics"
## [49] "Computers...Software" "Home.Buyer"
## [51] "Cord.Cutter" "Deal.Seeker"
## [53] "Luxury.Shopper" "Big.Spender"
## [55] "Online.Buyer" "PostalCode"
## [57] "entertainment.sites" "Communities"
## [59] "Groups" "Influencers"
# Every single one of our observations has at least one missing value
df %>%
count(!complete.cases(.))
## !complete.cases(.) n
## 1 TRUE 284
We have a lot of variables that are incomplete and don’t offer
clear insights on potential customer segment. Let’s recode the relevant
ones as factors and take a deeper look.
df <- df %>%
mutate(Age = as.factor(Age),
Gender = as.factor(Gender),
Industry = as.factor(Industry),
CountryName = as.factor(CountryName),
SignUpType = as.factor(SignUpType),
SubscriptionStatus = as.factor(SubscriptionStatus),
PlanName = as.factor(PlanName),
AlreadySelling = as.factor(AlreadySelling),
Revenue = as.factor(Revenue),
Are.they.active = as.factor(Are.they.active))
We will recode overlapping categories in key variables
# Re-coding overlapping age categories
unique(df$Age)
## [1] 25-34 45-54 <NA> 35-44 65+ 55-64 54-65 34-45 44-55 45-55
## Levels: 25-34 34-45 35-44 44-55 45-54 45-55 54-65 55-64 65+
df$Age <- replace(df$Age, df$Age == "45-55", "45-54")
df$Age <- replace(df$Age, df$Age == "44-45", "45-54")
df$Age <- replace(df$Age, df$Age == "44-55", "45-54")
df$Age <- replace(df$Age, df$Age == "34-45", "35-44")
df$Age <- replace(df$Age, df$Age == "54-65", "55-64")
# Re-coding wrong PlanName and CountryName as NA
df$PlanName <- replace(df$PlanName, df$PlanName == "/", NA)
df$CountryName <- replace(df$CountryName, df$CountryName == "0", NA)
Let’s summarize our most relevant variables
df %>%
select(where(is.factor))%>%
summary.data.frame()
## Age Gender CountryName SignUpType
## 35-44 : 34 Female:146 United States of America:184 New Signup :274
## 45-54 : 31 Male :105 United Kingdom : 19 Weebly : 7
## 25-34 : 17 NA's : 33 Australia : 18 Weebly Upgrade: 3
## 55-64 : 14 Canada : 15
## 65+ : 10 Switzerland : 6
## (Other): 0 (Other) : 41
## NA's :178 NA's : 1
## SubscriptionStatus PlanName
## active:284 Starter :150
## Professional : 68
## Starter Plus : 47
## Webinar - Starter: 7
## Trial : 6
## (Other) : 5
## NA's : 1
## Industry
## Arts & Crafts : 5
## Consulting : 53
## Fitness :126
## Medical : 28
## Other : 51
## Tutoring (Languages, Math etc.): 14
## NA's : 7
## AlreadySelling Revenue
## I am already selling online using a different system: 71 0 :116
## I haven't yet started selling :135 0-5k : 32
## I'm already selling, just not online : 53 1m+ : 10
## I'm just playing around : 18 250k-1m : 25
## NA's : 7 50k-250k: 43
## 5k-50k : 51
## NA's : 7
## Are.they.active
## No : 81
## Yes :201
## NA's: 2
##
##
##
##
We already have some valuable insights! The most common age group (besides NA, which we should be mindful of) are 35-44 and 45-44. Sample seems to be mostly American and female. Most popular industry is fitness, with most accounts having not yet made online sales or overall revenue despite being active.
Let’s confirm some of these intuitions graphically
It seems that a majority of the sample is in fact comprised by women in the fitness industry. They have an active account but have not yet started selling online or made any revenue off their business.
We will now seek to confirm this intuition statistically through K-mode clustering.
K-Means uses mathematical measures (distance between means) to cluster continuous data. The lesser the distance, the more similar our data points are. However, measures of distance are not truly meaningful when it comes to categorical data (distance between our ‘0’ and ‘1’ dummy variables is always 1).
In K-modes, the data is represented as a set of categorical variables, and the algorithm attempts to partition the data into k clusters based on the modes (most frequent values) of the categorical variables in each cluster. In other words, k-modes defines clusters based on the similarities of the categorical variable patterns in the data (Goyal & Agarwal, 2017)
We will only use our most relevant variables so to not add unnecessary noise to our results
condensed_df <- df%>%
select(Age, Gender, CountryName,
PlanName, Industry, AlreadySelling, Revenue, Are.they.active)
K-mode clustering will not work if we feed it NA values. We will work around this by turning them into a string.
library(gtools)
dfwNA <- condensed_df %>%
mutate_if(is.factor, as.character)
dfwNA <- gtools::na.replace(dfwNA, replace = "NA")
Clustering algorithms like K-means and K-mode need a pre-specified number of clusters to run. We’ll estimate the ideal number of clusters by gradually increasing the number of clusters (modes) with a loop and comparing their fit.
library(klaR)
set.seed(1222)
Es <- numeric(10)
for(i in 1:10){
kpres <- kmodes(dfwNA, modes = i, iter.max = 15, fast = TRUE)
Es[i] <- kpres$withindiff
}
plot(1:10, Es, type = "b", ylab = "Within Cluster distance", xlab = "# Clusters",
main = "Scree Plot") # figure 2
The lower the within-cluster simple-matching distance (y-axis), the more
compact and similar the data points within the cluster are. We can see
that, at 4 clusters, we’ve effectively minimized the internal distance
(clusters are concise and most dissimilar to the rest) with the least
number of clusters. This is known as the ‘elbow’ method.
We can now run our algorithm specifying that the end result needs to
return 4 clusters
mode_clusterswNA <- kmodes(dfwNA, modes = 4, iter.max = 15, fast = TRUE)
clusterswNA <- mode_clusterswNA$modes
clusterswNA <- clusterswNA %>%
mutate(Size = mode_clusterswNA$size)
| Age | Gender | CountryName | PlanName | Industry | AlreadySelling | Revenue | Are.they.active | Size |
|---|---|---|---|---|---|---|---|---|
| NA | Female | United States of America | Professional | Fitness | I am already selling online using a different system | 50k-250k | Yes | 64 |
| NA | Male | United States of America | Starter | Fitness | I haven’t yet started selling | 0 | Yes | 58 |
| NA | Female | United States of America | Starter | Fitness | I haven’t yet started selling | 0 | Yes | 103 |
| NA | Male | United States of America | Starter Plus | Other | I haven’t yet started selling | 0 | No | 59 |
```
We can see that the largest cluster (103 costumers) is comprised of American women in the Fitness industry. They have active starter accounts but have not yet started selling their content online. Although we don’t have information about their age this seems to confirm our intuitions about the characteristics of Yondo’s largets costumer segment
Our original excel file contained links to costumers’ social media
(which I’ve since deleted from the repository file). Filtering for
costumers who belong to our cluster of choice, I built a Persona based
on this segment’s profile as reflected by their social media presence.
Meet Jill!